EXPLORATORY DATA ANALYSIS

1. Data preparation

The data have been downloaded from the website and a basic cleaning has been done in the preparation step. Due to the size of the dataset, which is 2.5 Gb for 3 months of data, a sample is obtained in order to explore the data more efficiently.

The preparation of the data include:

  • Remove the N/A values of the data
  • After check for the variables and its meaning, extract the values among its (logical) range:
    • fare > 0
    • trip distance > 0
    • extra charges >= 0
    • tolls >= 0
    • total amount > 0
    • improvement surchage == 0.3
    • passesngers >= 0

Load the libraries.

library(dplyr)
library(reshape2)
library(magrittr)
library(latticeExtra)
library(zoo)
library(ggplot2)
library(raster)
library(sf)
library(RColorBrewer)
library(lubridate)

1. Read the taxi data

The sample data is read after the preparation step which includes a basic cleaning and preparation of the data.

data  <-  read.csv("data/taxi_sample_data.csv")

2. Load the taxi zone data

Besides the records of trips, the website includes a .csv with the taxi zones in New York

zones <-  read.csv("data/taxi_zones.csv")

3. Plot the area of the analysis

TCL zones

plot(spZones["zone"])

From the shp file, we have also information about boroughs:

borough <-  ggplot() + geom_sf(data=spZones, aes(fill = borough))
borough

There is also information about the kind of service area in the .csv

a <- merge(spZones, zones[,c(1,4)], by='LocationID')
service <-  ggplot() + geom_sf(data=a, aes(fill = service_zone))
service

4. Make a summary of the dataset for a first analysis.

summary(data)
##     VendorID              tpep_pickup_datetime         tpep_dropoff_datetime
##  Min.   :1.000   2017-06-28 20:31:05:    3     2017-03-18 03:32:29:    3    
##  1st Qu.:1.000   2017-11-27 20:08:58:    3     2017-06-05 19:30:40:    3    
##  Median :2.000   2017-03-01 18:04:24:    2     2017-03-01 18:11:50:    2    
##  Mean   :1.549   2017-03-01 19:07:32:    2     2017-03-01 19:20:12:    2    
##  3rd Qu.:2.000   2017-03-01 19:38:59:    2     2017-03-01 19:33:17:    2    
##  Max.   :2.000   2017-03-01 21:44:09:    2     2017-03-01 20:13:47:    2    
##                  (Other)            :43751     (Other)            :43751    
##  passenger_count trip_distance      RatecodeID    store_and_fwd_flag
##  Min.   :1.000   Min.   : 0.510   Min.   :1.000   N:43598           
##  1st Qu.:1.000   1st Qu.: 1.200   1st Qu.:1.000   Y:  167           
##  Median :1.000   Median : 1.900   Median :1.000                     
##  Mean   :1.619   Mean   : 2.942   Mean   :1.002                     
##  3rd Qu.:2.000   3rd Qu.: 3.500   3rd Qu.:1.000                     
##  Max.   :6.000   Max.   :52.200   Max.   :4.000                     
##                                                                     
##   PULocationID    DOLocationID    payment_type  fare_amount    
##  Min.   :  4.0   Min.   :  3.0   Min.   :1     Min.   :  2.50  
##  1st Qu.:113.0   1st Qu.:100.0   1st Qu.:1     1st Qu.:  7.00  
##  Median :161.0   Median :161.0   Median :1     Median : 10.00  
##  Mean   :161.5   Mean   :158.4   Mean   :1     Mean   : 12.71  
##  3rd Qu.:231.0   3rd Qu.:234.0   3rd Qu.:1     3rd Qu.: 15.00  
##  Max.   :265.0   Max.   :265.0   Max.   :1     Max.   :245.00  
##                                                                
##      extra           mta_tax      tip_amount       tolls_amount    
##  Min.   :0.5000   Min.   :0.5   Min.   :  0.010   Min.   : 0.0000  
##  1st Qu.:0.5000   1st Qu.:0.5   1st Qu.:  1.550   1st Qu.: 0.0000  
##  Median :0.5000   Median :0.5   Median :  2.150   Median : 0.0000  
##  Mean   :0.6649   Mean   :0.5   Mean   :  2.765   Mean   : 0.2123  
##  3rd Qu.:1.0000   3rd Qu.:0.5   3rd Qu.:  3.200   3rd Qu.: 0.0000  
##  Max.   :1.0000   Max.   :0.5   Max.   :425.470   Max.   :37.7600  
##                                                                    
##  improvement_surcharge  total_amount   
##  Min.   :0.3           Min.   :  5.31  
##  1st Qu.:0.3           1st Qu.: 10.30  
##  Median :0.3           Median : 13.56  
##  Mean   :0.3           Mean   : 17.16  
##  3rd Qu.:0.3           3rd Qu.: 19.80  
##  Max.   :0.3           Max.   :445.27  
## 

Main characteristics seen in summary are related to the numeric variables, where maximum values are much higher than the mean or the median: total amount, tip amount, tolls, fare amount etc. This could be related to outliers or skewed distributions of the data.

2. Numeric variables

We will now explore the numeric variables and look for the main characteristics and data distributions.

# extract from the dataset the numeric variables
df  <-  data[, c(5,11,14,15,17)]
df  <-  melt(df)
## No id variables; using all as measure variables
# Theme characteristics for plots.
myTheme  <- custom.theme(pch=20, cex=0.5)
myTheme$strip.background$col  <- "white"

# Frequency of data in each variable
histogram(~value|variable, data=df, par.settings=myTheme, xlab=" ", scales='free')

The boxplot can also help to see the data distribution. There are some data records which present very high values for the numeric variables. So the distribution does not seem to be normal.

ggplot(df, aes(x=variable,y=value)) +
    geom_boxplot(aes(fill=value),outlier.size = 0.1) +
    labs(x=" ", y=" ") +
     facet_wrap(~ variable, scales='free')

We include new conditions for the numeric variables. The total, fare and tip amount have extremely large values.

We also check that some records have lower fares than tips, so that values are removed.

Aditionally, the tip should be at least 10% of the fare and we also consider only tips below 30% of the fare. The idea is to recommend a tip depending on other variables, but a tip above 30% cannot be recommended.

Remove records with fares lower than tips and tips lower than 10% and above 30%:

data  <- data[data$fare_amount > data$tip_amount,]
data  <-  data[data$tip_amount > 0.1*data$fare_amount,]
data  <-  data[data$tip_amount < 0.3*data$fare_amount,] 

histogram(data$tip_amount, par.settings=myTheme, xlab="tip amount")

We can have a look on the tip amount as a % of the fare amount.

histogram((data$tip_amount/data$fare_amount)*100, par.settings=myTheme, xlab="tip amount")

Most of the tips are between the 20 and 25% of the fare amount.

1. Trip time duration

Another numeric variable that could be relevant is the duration of each trip. We can calculate this variable from the pick up and drop off time.

Time stamp are loaded as factor. Translate into daytime:

data$tpep_pickup_datetime <- as.POSIXct(as.character(data$tpep_pickup_datetime), tz = "GMT", format="%Y-%m-%d %H:%M:%S")
data$tpep_dropoff_datetime <- as.POSIXct(as.character(data$tpep_dropoff_datetime), tz = "GMT",format="%Y-%m-%d %H:%M:%S")

## transform to NY time zone

data$tpep_pickup_datetime <- with_tz(data$tpep_pickup_datetime, tzone = "America/New_York")
data$tpep_dropoff_datetime <- with_tz(data$tpep_dropoff_datetime, tzone = "America/New_York")

## new column with the differences:

## it takes few minutes
data$trip_timelength  <- apply(data, 1, FUN=function(x) difftime(x[3], x[2]))

# Plot the new variable
histogram(data$trip_timelength, par.settings=myTheme, xlab="trip time duration", scales='free')

The data have negative values that should be removed

data  <- data[data[,"trip_timelength" ] > 5,]
df  <-  data[, c(5,11,14,15,17,18)]
df  <-  melt(df)

histogram(data$trip_timelength, par.settings=myTheme, xlab="trip time duration", scales='free')

We can now plot all the numeric variables together again and we will see how the largest values in the fare amount have been removed at the same time that the negative trip duration. The boxplots show us the general statistics of the data, with the median value and the range of the quantiles; we can as well add the density functions in a violin plot, which helps with the shape of the data distributions when it is different from the normal.

# extract from the dataset the numeric variables
df  <-  data[, c(5,11,14,17,18)]
df$tipPer <- df[,3]/df[,2]*100 
df  <-  melt(df)
## No id variables; using all as measure variables
ggplot(df, aes(x=variable,y=value)) +
    geom_violin(trim=FALSE)+  
    geom_boxplot(width=0.2, outlier.shape=NA, aes(fill=value)) +
    labs(x=" ", y=" ") +
     facet_wrap(~ variable, scales='free')

2. Numerical variables relationships

We can have a look on the scatterplots of the numerical variables: a linear relationship is seen, but with some stratified characteristics. If we have a look on the tip amount at the center of the scatterplot matrix, we can see that there are some values of constant ‘tip amount’ independently of the fare amount.

pairs(data[,c(5,11,14,17,18)], cex=0.3)

Once the general linear relationship is seen, we can remove some of the outliers.

## First, calculate the quantiles and the IQR in the numeric variables:

quant  <-  function(x) {
    q  <-  quantile(x, probs=c(.25 ,.75))
    iqr  <- IQR(x)
    
    up  <-  q[2] + 1.5 *iqr
    down  <- q[1] - 1.5 *iqr

    c  <- c(up,down)
    c
}

## apply this function to the 5 relevant numeric variables:

var  <-  list("trip_distance","fare_amount","tip_amount","tolls_amount","total_amount")
q  <- lapply(var, FUN=function(x) quant(data[,x]))
names(q) <- var

## filter the dataset with these conditions:

subsetq <-  function(x) {
    for (i in 1:length(var)) {
        out  <- subset(x, x[var[[i]]] < (q[[i]][1]) & x[var[[i]]] > (q[[i]][2]))
        
    }
    out
}

data  <- subsetq(data)

If we represent again the data:

# extract from the dataset the numeric variables
df  <-  data[, c(5,11,14,17,18)]
df$tipPer <- df[,3]/df[,2]*100 
df  <-  melt(df)
## No id variables; using all as measure variables
ggplot(df, aes(x=variable,y=value)) +
    geom_violin(trim=FALSE)+  
    geom_boxplot(width=0.2, outlier.shape=NA, aes(fill=value)) +
    labs(x=" ", y=" ") +
     facet_wrap(~ variable, scales='free')

3. Hour off pick up or drop off

Another variable that could be relevant is the time of pick up or drop off the taxi. Create new columns with the time (hour) of the pick up and drop off.

d1 <- as.character(data$tpep_pickup_datetime)
d2 <- as.character(data$tpep_dropoff_datetime)

data$t_pu <- hour(data$tpep_pickup_datetime)
data$t_do <- hour(data$tpep_dropoff_datetime)

Plot the frequency of records by hour of bick up.

df  <- data[, c("tip_amount","t_pu", "t_do")]
barchart(table(data$t_pu), par.settings=myTheme, xlab=" ", horizontal=FALSE)

It can be seen that there is little amount of data between 2 to 9 a.m compared to the rest of the day. Less taxi trips are made on overnight hours. We see if there is any different on the tip depending on the pick up hour:

df <- df %>% 
group_by(t_pu) %>%
    summarise(m = median(tip_amount))%>%
    as.data.frame()
 
dotplot(m~as.factor(t_pu), data=df, pch=20, grid=TRUE, cex=2, ylab='median tip', xlab='time', par.settings=myTheme)

Between 0 and 8 in the morning, the median tip amount behaves different, but we should be careful, due to the fact that it is the beriod with less records and the conclusions might not be representative.

The median value of tip (%) is representd for each hour. Again, it is difficult to find a conclusion due to the lack of data in the range of 0 to 9 hours, but highest value is found at 9 a.m and a trend for higher values between 11 to 12.

df  <- data[ ,c("t_pu", "t_do")]
df$tipPer <- data[,14]/data[,11]*100


df <- df %>% 
group_by(t_pu) %>%
    summarise(m = median(tipPer))%>%
    as.data.frame()
 
dotplot(m~as.factor(t_pu), data=df, pch=20, grid=TRUE, cex=2, ylab='median % tip', xlab='time', par.settings=myTheme)

3. Categorical variables

After having seen the numeric variables of the dataset, we take care of the categorical variables that it includes.

  • Payment type: only card record is found
which(data$payment_type != 1) # records of payments different than 1
## integer(0)
  • RateCodeID: there is only code number 1
histogram(data$RatecodeID, par.settings=myTheme)

which(data$RatecodID != 1)
## integer(0)

1. Pick up and drop off location ID:

From the zones .csv there is information about the zones LocationID, Borough and service zone. We create 2 columns in the dataframe with the PU (pick up) and DO (drop off) service and borough zone.

We can see that LocationID have 265 values, but zones only have information about 263 zones. In addition, the shape file has no information on this zones. Due to that, we decide to remove these records before continue:

data <- data[data$PULocationID <= 263,]
data <- data[data$DOLocationID <= 263,]
tail(zones)
##     LocationID   Borough               Zone service_zone
## 260        260    Queens           Woodside    Boro Zone
## 261        261 Manhattan World Trade Center  Yellow Zone
## 262        262 Manhattan     Yorkville East  Yellow Zone
## 263        263 Manhattan     Yorkville West  Yellow Zone
## 264        264   Unknown                 NV          N/A
## 265        265   Unknown               <NA>          N/A
zones <- zones[1:263,]

Create 2 variables for the PU and DO service zone

data$PU_servicezone <-  data$PULocationID
data$DO_servicezone <-  data$DOLocationID

## replace the ID code with the zone in the zone dataset

repl <-  function(x) {
    for (i in 1:263) {
        x[which(x["PU_servicezone"] == i), "PU_servicezone"]  <-  as.character(zones[i,"service_zone"])
        x[which(x["DO_servicezone"] == i), "DO_servicezone"]  <-  as.character(zones[i,"service_zone"])
    }
    x }

data <-  repl(data)
   
data$PU_servicezone <-  factor(data$PU_servicezone, levels=c("Airports","Boro Zone", "EWR", "Yellow Zone"))
data$DO_servicezone <-  factor(data$DO_servicezone, levels=c("Airports","Boro Zone", "EWR", "Yellow Zone"))

create 2 variables for the PU and DO borough

data$PU_boroughzone <-  data$PULocationID
data$DO_boroughzone <-  data$DOLocationID

## replace the ID code with the zone in the zone dataset

repl <-  function(x) {
    for (i in 1:263) {
        x[which(x["PU_boroughzone"] == i), "PU_boroughzone"]  <-  as.character(zones[i,"Borough"])
        x[which(x["DO_boroughzone"] == i), "DO_boroughzone"]  <-  as.character(zones[i,"Borough"])
    }
    x }
 
data <-  repl(data)
 
data$PU_boroughzone <-  factor(data$PU_boroughzone, levels=c("Bronx", "Brooklyn", "EWR", "Manhattan", "Queens", "Staten Island"))
data$DO_boroughzone <-  factor(data$DO_boroughzone, levels=c("Bronx", "Brooklyn", "EWR", "Manhattan", "Queens", "Staten Island"))

We can now plot the frequency of each trip depending on its pick up and drop off area. The next figure shows how most of the trips are picked up in Manhattan, and they finish in the same area. Few trips that started in Manhattan finish in Brooklyn and Queens.

df <- data[,c(23,24)]

barchart(table(df), auto.key=TRUE, par.settings=myTheme,xlab="times", ylab='pick up')

There is also few amount of data records that show trips starting at Queens and Brooklyn, ending in a similar amount in Manhattan, Brooklyn or Queens. So that means that the trip flows are inside or to Manhattan mostly.

The same can be done by service zone, where we have information about the specific location of airports, what could be interesting.

df <- data[,c(21,22)]

barchart(table(df), auto.key=TRUE, par.settings=myTheme,xlab="times", ylab='pick up')

From that graph we see that half of the trips from airports end in the ‘Yellow Zone’ which is mostly the Manhattan area, as we could see in the figures at the beginning of the report.

We can see this variables on a map, taking advance of the higher resolution of the PU and DO location ID:

Number of times that the taxi is pick up by zone

PU <- data %>% 
    group_by(PULocationID) %>%
    summarise(no_rows = length(PULocationID)) %>%
    as.data.frame()    
names(PU) <- c("LocationID","PU_times")


b <- merge(spZones, PU, by='LocationID')

## logaritmic scale because there is a lot of difference between airports (maximum) and the rest
pu <-  ggplot() + geom_sf(data=b, aes(fill = PU_times))+scale_fill_continuous(trans = "log10", type='viridis')

pu

Number of times that the taxi is drop off up by zone

DO <- data %>% 
    group_by(DOLocationID) %>%
    summarise(no_rows = length(DOLocationID))%>%
    as.data.frame()    
names(DO) <- c("LocationID","DO_times")

c <- merge(spZones, DO, by='LocationID')
do <-  ggplot() + geom_sf(data=c, aes(fill = DO_times))+scale_fill_continuous(trans = "log10",type='viridis')

do

As it is the yellow taxi data, most of the trips are in the Manhattan area.

2. New categorical variables:

Day-night

We can create a new categorical variable related to the time of the day in which the trip is made. From the ‘hour’ variable we can detect if it is ‘daytime’ or nigth.

data$t_puH  <-  data$t_pu
data$t_doH  <-  data$t_do

data[which(data[ ,"t_puH"] >= 7 & data[ ,"t_puH"] <= 20), 't_puH']  <- "day"
data[which(data[ ,"t_puH"] != "day"), 't_puH']  <- "night" 

data[which(data[ ,"t_doH"] >= 7 & data[ ,"t_doH"] <= 20), 't_doH']  <- "day"
data[which(data[ ,"t_doH"] != "day"), 't_doH']  <- "night" 

data$t_puH  <-  as.factor(data$t_puH)
data$t_doH  <-  as.factor(data$t_doH)

histogram(data$t_puH, par.settings=myTheme)

Most of the trips are made during the daytime hours. The selection of the hours is kind of arbitrary ,and a different classification of ‘day’ and ‘night’ could be also possible.

Weekday

We consider now the day of the week as well as a categorical variable, in order to see if there is a day in which people take more the taxi or if it influences on the fares/tips etc.

data$weekday <- weekdays(data$tpep_pickup_datetime)
data$weekday <- factor(data$weekday, levels=c("domingo","lunes","martes", "miércoles","jueves","viernes","sábado"))

# plot

barchart(as.factor(data$weekday), par.settings=myTheme, horizontal=FALSE,xlab='day of the week')

Although most of the days have similar amount of records, it is clear that Sunday is the day when there are less taxi trips in the New York city.

Intra trip

We create another categorical variable related to the intra characteristic of the record. We consider trips that pick up and drop off in a different borough as intratrip. Next graph shows how most of the trip records are made in the same borough.

data$intratrip <- seq(1:nrow(data))
data[which(data$PU_boroughzone != data$DO_boroughzone), "intratrip"] <- "yes"
data[which(data$PU_boroughzone == data$DO_boroughzone), "intratrip"] <- "no"

histogram(as.factor(data$intratrip), par.settings=myTheme, xlab='intra-trip')

4. Numeric variables and categorical variables

4.1 boxplot for trip variables

We can represent numeric variables depending on some categorical variables like: the pick up or drop off area or the day of the week.

Relationship with the borough:

## all together:
tip <- melt(data[,c(14,23,24)], 'tip_amount')
trip  <-  melt(data[,c(5,23,24)], 'trip_distance')
fare  <-  melt(data[,c(11,23,24)], 'fare_amount')
total  <-  melt(data[,c(17,23,24)], 'total_amount')
trip_time  <-  melt(data[,c(18,23,24)], 'trip_timelength')

all  <-  cbind(trip, tip, total, fare,trip_time)
all  <-  all[,c(1,4,7,10,13,14,15)]

b  <-  melt(all)
## Using variable, value as id variables
names(b)  <-  c("pu_do","borough","variable","value")
b$pu_do <- as.character(b$pu_do)
b[(b$pu_do == "PU_boroughzone"), "pu_do"] <- 'PU'
b[(b$pu_do == "DO_boroughzone"), "pu_do"] <- 'DO'


ggplot(b, aes(x=pu_do,y=value)) +
    geom_boxplot(aes(fill=borough),outlier.size = 0.1) + scale_fill_brewer(palette="Paired") +
    labs(x=" ", y=" ") +
     facet_wrap(~ variable, scales='free', ncol=2) + theme(legend.position="top")

The above picture shows that the are few differences on the data distribution of each variable regardless of the pick up and drop off area. Among areas, we see how the borough of Manhattan is the one with lower fare, tip amount, trip distance and trip duration.

Relationship with day of the week

all <- data[,c(5,11,14,18,27)]
all <- melt(all, 'weekday')

ggplot(all, aes(x=weekday,y=value)) +
  geom_violin(trim=FALSE)+
    geom_boxplot(aes(fill=weekday),outlier.size = 0.1, outlier.shape = NA, width=0.3) + scale_fill_brewer(palette="Paired") +
    labs(x=" ", y=" ") +
     facet_wrap(~ variable, scales='free', ncol=2)+ theme(legend.position="top")

The figure shows that he sunday is different from the rest of the week, with higher median values from the trip distance. However, the rest of the variables it seems to be higher in the central days of the weeks. Differences are subtle in any case.

Showing the intra trip variable

trip  <-  melt(data[,c(5,28)], 'trip_distance')
tip  <-  melt(data[,c(14,28)], 'tip_amount')
fare  <-  melt(data[,c(11,28)], 'fare_amount')
total  <-  melt(data[,c(17,28)], 'total_amount')
trip_time  <-  melt(data[,c(18,28)], 'trip_timelength')

all  <-  cbind(trip, tip, total, fare, trip_time)
all  <-  all[,c(1,4,7,10,13,14,15)]

b  <-  melt(all)[,2:4]
## Using variable, value as id variables
names(b)  <-  c("intratrip","variable","value")

ggplot(b, aes(x=intratrip,y=value)) +
       geom_boxplot(aes(fill=intratrip),outlier.size = 0.1) + scale_fill_brewer(palette="Paired") +
    labs(x=" ", y=" ") +
     facet_wrap(~ variable, scales='free')

It seems to be a clear increase in all the numeric variables for when the taxi trip goes from a borough to another. However, if we have a closer look to the tip variable but in percentage:

## tip percentage
tip  <-  data[,c(11,14,28)]
tip$tipPer <- tip[,2]/tip[,1]*100

ggplot(tip, aes(x=intratrip,y=tipPer)) +
    labs(x=" ", y="tip % ")+
    geom_boxplot(aes(fill=intratrip),outlier.size = 0.1)+ scale_fill_brewer(palette="Paired") 

The median tip % is lower for the intra-borough trips although as seen in previous figure the tip amount is largest.

tip % and day-night time

We can also evaluate if the tip % changes with the time the taxi is picked up. It doesn’t seem to be important differences.

## tip percentage
tip  <-  data[,c(11,14,25)]
tip$tipPer <- tip[,2]/tip[,1]*100

ggplot(tip, aes(x=t_puH,y=tipPer)) +
  labs(x=" ", y="tip % ")+
    geom_boxplot(aes(fill=t_puH),outlier.size = 0.1)+ scale_fill_brewer(palette="Paired")

4.2. scatterplots with the 5 numeric relevant variables. Linear relationship.

library(GGally)
ggpairs(data[,c(5,11,14,17,18)], aes(colour = data$t_puH, alpha = 0.4), upper = list(continuous = wrap("cor", size=2.5)),lower = list(continuous = wrap(funcVal="points", alpha = 1, size=0.5)))

Main differences are found in the time length of the trips, where the daytime records show data where the same distance takes longer in comparison with night-time recors. These 2 variables are the ones that correlates less, although night data correlates better.

ggpairs(data[, c(5,11,14,17,18)], aes(colour = data$PU_servicezone, alpha = 0.4), upper = list(continuous = wrap("cor", size=2.5)),lower = list(continuous = wrap(funcVal="points", alpha = 1, size=0.5)))

Also we observe how most of the data of airports have a different distribution compared to the other areas. In general, trips from the airport are more expensive an takes more time. Aditionally the tip amount is higher. If we check if this also happends with the tip %:

df <- data
df$tipPer <- data[,14]/data[,11]*100
  
ggpairs(df[, c(5,11,29,17,18)], aes(colour = df$PU_servicezone, alpha = 0.4), upper = list(continuous = wrap("cor", size=2.5)),lower = list(continuous = wrap(funcVal="points", alpha = 0.5, size=0.5)))

In this case, the % tip is centered in 20-25% when the taxi is picked up in the Yellow or Boro Zone. In the case of the airports, we see a bimodal distribution, with a second maximum beyond 25%. However, it is important to notice that the amount of data from boroughs different to Manhattan is scarce.

4.3 Mean variables by Location ID

Spatial representation of the mean tip % depdending on the pick up location ID.

data$tipPer <-data[,14]/data[,11]*100

PU_tip<- data %>% 
    group_by(PULocationID) %>%
    summarise(no_rows =  mean(tipPer)) %>%
    as.data.frame()    
names(PU_tip) <- c("LocationID","PU_mean")

DO_tip<- data %>% 
    group_by(DOLocationID) %>%
    summarise(no_rows =  mean(tipPer)) %>%
    as.data.frame()    
names(DO_tip) <- c("LocationID","DO_mean")
d <- merge(spZones, PU_tip, by='LocationID')  
pu_tip<-  ggplot() + geom_sf(data=d, aes(fill = PU_mean))+scale_fill_continuous(type='viridis')
 
e <- merge(spZones, DO_tip, by='LocationID')  
do_tip<-  ggplot() + geom_sf(data=e, aes(fill = DO_mean))+scale_fill_continuous(type='viridis')

pu_tip

Most of the locations are around 20 to 23 but few of them present values below 15 and above 25.

We can present the same as previous figure but depending on the drop off location.

do_tip

In this cse there is one location with high values with respect to the others. In general there is homogenity.

Spatial representation of the mean trip distance depdending on the pick up location ID.

PU_length  <-  data %>%
    group_by(PULocationID) %>%
    summarise(no_rows = mean(trip_distance))%>%
    as.data.frame()    
names(PU_length) <- c("LocationID","PU_length")

DO_length  <-  data %>%
    group_by(DOLocationID) %>%
    summarise(no_rows = mean(trip_distance))%>%
    as.data.frame()    
names(DO_length) <- c("LocationID","DO_length")

For the pick up location:

f  <- merge(spZones, PU_length, by='LocationID')  
pu_length  <- ggplot() + geom_sf(data=f, aes(fill = PU_length))+scale_fill_continuous(type='viridis')

g  <- merge(spZones, DO_length, by='LocationID')
do_length  <- ggplot() + geom_sf(data=g, aes(fill = DO_length))+scale_fill_continuous(type='viridis')

pu_length

The trip length maximum is around 8 miles. Most of the trips are around 4. Inside Manhattan trip length is around 2 miles.

And the drop off:

do_length

The same map representation but depdending on the drop off shows different values. It seems to be a cluster centered in Manhattan with the lower values and largest values goes to the margins of the figure (in general). This has sense if we consider that most of the trips comes from the Manhattan area, so the distance increases for the locations far from that borough.

5. Conclusions

Some insights can be derived from the dataset of the taxi data:

About the cleaning process:

About the exploratory analysis:

MODELIZATION

1. Data preparation

The data is download from the website. As in the exploratory analysis, a sample of the data is used, due to the computational limitations.

1.1 Cleaning

After reading the data, it is cleaned taking into account the findings of the exploratory data analysis and some new variables are included:

In the cleaning phase:

  • Remove the N/A values of the data
    • fare > 0
    • trip distance > 0
    • extra charges >= 0
    • tolls >= 0
    • total amount > 0
    • improvement surchage == 0.3
    • passesngers >= 0

The other variables included are: * Duration of the trip * The day of the week * The ‘intratrip’ variable

The prepared data is stored as a .csv called taxi_model_sample_data.csv. The script where this is done is called: model_sample.R

1.2 Sample data

The dataset is then divided: 80% of the data is used as training data and the other 20% is used to test the model.

data <- read.csv("data/taxi_model_sample_data.csv")

#########################
## sample the data for modelling:

# Random sample indexes
train_index <- sample(1:nrow(data), 0.8 * nrow(data))
test_index <- setdiff(1:nrow(data), train_index)

# Build X_train, y_train, X_test, y_test
X_train <- data[train_index, -3]
y_train <- data[train_index, "tip_amount"]

X_test <- data[test_index, -3]
y_test <- data[test_index, "tip_amount"]

## save these datasets:

write.csv(X_train, "data/xtrain.csv", row.names=FALSE)
write.csv(y_train, "data/ytrain.csv",row.names=FALSE)
write.csv(X_test, "data/xtest.csv",row.names=FALSE)
write.csv(y_test, "data/ytest.csv",row.names=FALSE)

2. Model run

Code for the model run can be found in ‘model_run.R’, but I will reproduce it here.

## Built and run the model

## As seen in the EDA there is linear relationship among the numeric variables selected. Some of them are linear dependent on other, what should be seen in our model.

## Load the train data:

xtrain <- read.csv("data/xtrain.csv")
ytrain <- read.csv("data/ytrain.csv")

train  <- cbind(ytrain, xtrain)
names(train)  <- c("tip_amount",names(xtrain))

2.1. Multivariate regression model

We first test a multivariate linear regression model. The numerical variables include will be the fare amount, the trip duration, the trip distance. Additionally we will include 2 categorical variables: the day of the week and if the trip is inside the same borough or not (the ‘intratrip’ variable).

Run the model

## 1.1 model: use all the variables except the total amount that depends on the dependent variable 'tip'.

train <- train[, -4]

lm  <- lm(tip_amount~. ,data=train)
summary(lm)
## 
## Call:
## lm(formula = tip_amount ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.5601 -0.0986  0.1077  0.1908  5.0410 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.131570   0.011513  11.428  < 2e-16 ***
## trip_distance    -0.008293   0.006157  -1.347 0.178022    
## fare_amount       0.199459   0.003133  63.666  < 2e-16 ***
## trip_timelength   0.001510   0.001151   1.312 0.189359    
## intratripyes      0.208733   0.007652  27.279  < 2e-16 ***
## weekdayjueves     0.028564   0.010169   2.809 0.004971 ** 
## weekdaylunes      0.041080   0.010647   3.858 0.000114 ***
## weekdaymartes     0.036796   0.010574   3.480 0.000502 ***
## weekdaymiércoles  0.032591   0.010249   3.180 0.001473 ** 
## weekdaysábado    -0.009222   0.010800  -0.854 0.393161    
## weekdayviernes    0.016534   0.010188   1.623 0.104617    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6531 on 105364 degrees of freedom
## Multiple R-squared:  0.8625, Adjusted R-squared:  0.8625 
## F-statistic: 6.611e+04 on 10 and 105364 DF,  p-value: < 2.2e-16
##confidence  <- confint(lm)
lm$coefficients
##      (Intercept)    trip_distance      fare_amount  trip_timelength 
##      0.131570301     -0.008292670      0.199458921      0.001510169 
##     intratripyes    weekdayjueves     weekdaylunes    weekdaymartes 
##      0.208733226      0.028564056      0.041079572      0.036795800 
## weekdaymiércoles    weekdaysábado   weekdayviernes 
##      0.032591283     -0.009221740      0.016533708

The training show a good performance of the linear regression. We can check for the importance of the different variables in the summary of the model.

Test the model

xtest  <- read.csv("data/xtest.csv")
xtest <- xtest[,-3] ## remove the total amount
ytest  <- read.csv("data/ytest.csv")

predictions  <- predict(lm, xtest, interval="prediction")
results  <- cbind(ytest, predictions)

The xyplot for the test and predicted:

plot(results$x~results$fit, xlab='predictions', ylab='test', cex=0.5)
abline(0,1, col='red')

We can check the predictions and we can also see that the errors are normally distributed.

## Normal distribution of residuals.

histogram(lm$residuals, par.settings=myTheme, xlab='residuals')

## RMSE
RMSE_lm <- sqrt(mean((results$x - results$fit)^2))
RMSE_lm
## [1] 0.6492917

However, the variance of the residuals is not constant, which means that the linear model may not be the best solution to this problem. It could be that we are missing some predictors or that some of the data collectd (those records with constant tip regardless the fare) cannot be used in the model.

plot(lm$residuals~lm$fitted.values, cex=0.5, xlab='fit values', ylab='residuals')
abline(c(0,0), col='red')

2.2 Random Forest

We now check a random forest in order to check if it improves the predictions of the multilinear model.

Run the model

library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
# It takes long
rf_model <- randomForest(tip_amount ~., data = train,  importance = TRUE, ntree=300)
rf_model
## 
## Call:
##  randomForest(formula = tip_amount ~ ., data = train, importance = TRUE,      ntree = 300) 
##                Type of random forest: regression
##                      Number of trees: 300
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 0.4868762
##                     % Var explained: 84.31
importance(rf_model)[,1] ## show the importance of the variables
##   trip_distance     fare_amount trip_timelength       intratrip         weekday 
##      19.4983466      20.1764098      19.8973713      12.8809826       0.4999912

Test the model

rf_predict <- predict(rf_model, xtest)

results  <- cbind(ytest, rf_predict)

plot(results$x~results$rf_predict, xlab='predictions', ylab='test', cex=0.5)
abline(0,1, col='red')

We can compute the error and RMSE as well.

RMSE_rf <- sqrt(mean((results$x-results$rf_predict)^2))
RMSE_rf
## [1] 0.695093

The relative error can be represented. Most of the time the error is small, but there are few predictions that are far from the real value.

err  <- round((abs(results$x-results$rf_predict)/results$x)*100, digits=1)
histogram(err, par.settings=myTheme)

The random forest approach gives similar results to the multivariate regression model.

FURTHER ANALYSIS

In order to go beyond this analysis: